Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health monitor #112

Merged
merged 42 commits into from
Apr 23, 2024
Merged

Health monitor #112

merged 42 commits into from
Apr 23, 2024

Conversation

breuleux
Copy link
Member

New commands

# Loop through all checks
sarc health check --config health-monitor.yaml
sarc --color -vvv health check --config health-monitor.yaml   # Looks better

# Run a single check
sarc health check --config health-monitor.yaml --name check_name

# Run health monitor
sarc health monitor --config health-monitor.yaml

Example configuration

sarc:
  health_monitor:
    directory: check-results

    parameterizations:
      cluster_name:
        - mila
        - narval

    checks:
      at_least_one_{cluster_name}:
        class: sarc.alerts.checks:FilterCheck
        active: true
        interval: 1h
        min_count: 1
        filters:
          - cluster_name == params.cluster_name
          - ran_for(hours=5)
          - allocated_gpu()

      has_gpu_util_{cluster_name}:
        class: sarc.alerts.checks:FilterCheck
        active: false
        interval: 1s
        min_count: 1
        filters:
          - cluster_name == params.cluster_name
          - ran_for(hours=5)
          - stored_statistics?.gpu_utilization?.mean > 0
        depends: "at_least_one_{cluster_name}"

# Use this to test at a specific time and make it so that every sleep will take 0.5 seconds
time:
  class: FrozenTime
  time: "2023-10-08T15:47:47Z"
  sleep_beat: 0.5

@breuleux breuleux marked this pull request as ready for review April 2, 2024 20:29
@nurbal nurbal merged commit 1bb9f56 into mila-iqia:master Apr 23, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants